DATA SCIENCE SESSIONS VOL. 3¶

A Foundational Python Data Science Course¶

Session 15: Regularization in MLR. The Maximum Likelihood Estimation (MLE).¶

← Back to course webpage

Feedback should be send to goran.milovanovic@datakolektiv.com.

These notebooks accompany the DATA SCIENCE SESSIONS VOL. 3 :: A Foundational Python Data Science Course.

Lecturers¶

Goran S. Milovanović, PhD, DataKolektiv, Chief Scientist & Owner

Aleksandar Cvetković, PhD, DataKolektiv, Consultant

Ilija Lazarević, MA, DataKolektiv, Consultant


Let's first import all of the necessary libraries for later.

In [ ]:
### --- Setup - importing the libraries

# - supress those annoying 'Future Warning'
import warnings
warnings.simplefilter(action='ignore')


# - data
import numpy as np
import pandas as pd

# - os
import os

# - ml
from scipy.stats import norm
import statsmodels.api as sm
import statsmodels.formula.api as smf
from statsmodels.stats.outliers_influence import variance_inflation_factor


# - visualization
import matplotlib.pyplot as plt
import seaborn as sns

# - sklearn
from sklearn import linear_model
from sklearn.linear_model import LinearRegression

# - parameters
%matplotlib inline
pd.options.mode.chained_assignment = None  # default='warn'
sns.set_theme()

# - rng
rng = np.random.default_rng(seed=513)

# - plots
plt.rc("figure", figsize=(8, 6))
plt.rc("font", size=14)
sns.set_theme(style='white')

# - directory tree
data_dir = os.path.join(os.getcwd(), '_data')

0. The Problem of Overfitting¶

Overfitting is a common problem in machine learning where a model learns to fit the training data too closely, leading to poor performance on new or unseen data. In other words, the model becomes too complex and starts to memorize the training data instead of learning the underlying patterns and relationships.

Overfitting occurs when a model is trained on a limited set of data, and the model is too complex, with too many parameters or features, to generalize to new data. This can lead to a model that performs very well on the training data but poorly on new data.

Overfitting can be identified by comparing the model's performance on the training data to its performance on a validation set or test data. If the model performs significantly better on the training data than on the validation set, then it is likely overfitting.

To prevent overfitting, one can use techniques such as regularization, early stopping, or reducing the complexity of the model. Regularization techniques, such as L1 or L2 regularization, can penalize the model for having too many features or high parameter values, while early stopping can stop the training process before the model starts to overfit. Additionally, reducing the complexity of the model, such as by decreasing the number of features or using a simpler model architecture, can also help prevent overfitting.

Let's, for the sake of the example, generate a sample of bivariate distribution.

In [ ]:
# generate x
x = norm.rvs(loc=15, scale = 10, size = 1000000)

# y is a noisy linear transformation of x:
error_y = norm.rvs(loc=0, scale = 4, size = 1000000)

# set intercept and slope
intercept_y = 7.5
slope_y = .78

y = intercept_y + slope_y*x + error_y

data = pd.DataFrame({'x':x, 'y':y})

data
Out[ ]:
x y
0 27.402672 27.710456
1 16.747171 21.483032
2 14.356906 22.211644
3 -1.515171 7.705193
4 1.214776 3.510677
... ... ...
999995 24.478788 29.751996
999996 14.713014 13.542496
999997 17.390143 19.777645
999998 17.123090 29.303403
999999 10.531150 20.055937

1000000 rows × 2 columns

Now, shall we plot values of y as a function of x?

In [ ]:
# create a scatterplot of x and y
plt.scatter(data['x'], data['y'])

# create a linear regression model
lr = LinearRegression()
lr.fit(data[['x']], data['y'])

# create a range of x values to plot the regression line
x_range = np.linspace(data['x'].min(), data['x'].max(), 100)

# predict y values using the linear regression model and x_range values
y_pred = lr.predict(x_range.reshape(-1,1))

# add the regression line to the scatterplot
plt.plot(x_range, y_pred, color='red')

# r2 score
r2 = lr.score(data[['x']], data['y'])

# add labels and title to the plot
plt.xlabel('x')
plt.ylabel('y')
plt.title(f'Scatterplot of x and y with Regression Line (R2={r2:.3f})')

# show the plot
plt.show()

This is all fine. However, how about we decalare this bivariate distribution as a population, and sample 10k samples of 100 observations and do the same thing i.e. get the intercept and slope parameters for this linear model.

We will store parameter values with the value of R2, in dedicated Pandas DataFrame for easier display.

In [ ]:
# create an empty DataFrame to store the regression results
results_df = pd.DataFrame(columns=['Beta_0', 'Beta_1', 'R2'])

# perform 10000 linear regressions on samples of size 1000
for i in range(10000):
    # take a sample of size 100 without replacement
    sample = data.sample(n=100, replace=False)
    
    # perform linear regression on the sample
    lr = LinearRegression()
    lr.fit(sample[['x']], sample['y'])
    
    # extract the intercept, slope, and R2 from the linear regression
    beta_0 = lr.intercept_
    beta_1 = lr.coef_[0]
    r2 = lr.score(sample[['x']], sample['y'])
    
    # add the results to the results DataFrame
    results_df.loc[i] = [beta_0, beta_1, r2]

# display the results DataFrame
results_df
Out[ ]:
Beta_0 Beta_1 R2
0 7.518308 0.803567 0.786853
1 7.569126 0.752167 0.806824
2 8.680155 0.731020 0.770505
3 6.683363 0.848902 0.807314
4 7.305882 0.833274 0.820135
... ... ... ...
9995 7.866276 0.741158 0.748254
9996 9.405935 0.708189 0.725259
9997 7.397360 0.781120 0.821569
9998 7.146648 0.792564 0.853441
9999 7.174686 0.804436 0.797844

10000 rows × 3 columns

What is the distribution of R2 metric for all of the samples?

In [ ]:
results_df.hist('R2', bins=100);

And how did each of linear regression models look like? Here are the first 12 samples linear regression models, with observations.

In [ ]:
# take 12 random samples of size 1000 without replacement
samples = [data.sample(n=100, replace=False) for i in range(12)]

# create a 4 x 3 grid of subplots
fig, axes = plt.subplots(nrows=4, ncols=3, figsize=(12, 9))

# loop through each subplot and plot the scatterplot and regression line for one sample
for i, ax in enumerate(axes.flat):
    # perform linear regression on the sample
    lr = LinearRegression()
    lr.fit(samples[i][['x']], samples[i]['y'])
    
    # plot the scatterplot
    sns.scatterplot(x='x', y='y', data=samples[i], ax=ax)
    
    # plot the regression line
    sns.lineplot(x=samples[i]['x'], y=lr.predict(samples[i][['x']]), color='red', ax=ax)
    
    # set the title of the subplot to the R2 score
    ax.set_title('R2: {:.2f}'.format(lr.score(samples[i][['x']], samples[i]['y'])))
    
# set the overall title of the plot
fig.suptitle('Scatterplots with Regression Lines for 12 Samples')

# adjust the spacing between subplots
plt.tight_layout()

# show the plot
plt.show()

Note how for each sample intercepts and slopes are different, and how R2 is also varying. Why? We have sampled from the same 'population'.

1. Regularization in Multiple Linear Regression¶

1.1 Lasso (L1), Ridge (L2), and Elastic Net regression¶

Regularization is used for several purposes:

  • it helps reduce the problem of multicollinearity in Linear Regression,
  • it is used as method of feature selection in large models, and
  • it is used as a method to prevent overfitting.

Consider the following:

$$Loss = \sum_{i=1}^{n}{\big(y_i-\hat{y_i}\big)^2}$$

It is just the ordinary sum of squares, or the loss (or cost) function of the Linear Regression Model. The parameters $\beta$ of the Linear Model are estimate by minimizing the above quantity - the model's cost function. The expression above is just the ordinary $SSE$ function as we have encountered it before in our sessions on Linear Regression.

Now, consider the following formulation of the cost function that includes the penalty term:

$$Loss_{L2} = \sum_{i=1}^{n}{\big(y_i-\hat{y_i}\big)^2}+\lambda\sum_{j=1}^{p}\beta_j^2$$

$\lambda\sum_{j=1}^{p}\beta_j^2$ is the penalty term: it increases the value of the cost function by a factor determined by the sum of the squared model coefficients. In the expression above $p$ stands for the number of model parameters (coefficients, in the case of Linear Regression) and $\beta$ are the coefficients themselves. The above expression represents the $L2$ regularization for the Linear Model, also known as Ridge Regression. If $\lambda$ is zero we get back to the Ordinary Least Squares model that we have already learned about. If $\lambda$ is too large, it will add too much weight to the cost function which will lead to underfitting.

The underlying idea is to penalize large coefficients: it regularizes the coefficients so that if they take large values the cost function is penalized. In effect, the Ridge regression shrinks the coefficients so to reduce the model complexity and reduces the effects multicollinearity.

Consider now the following modification of the cost function:

$$Loss_{L1} = \sum_{i=1}^{n}{\big(y_i-\hat{y_i}\big)^2}+\lambda\sum_{j=1}^{p}|\beta_j|$$

The above cost function is based on a weighted sum of the absolute values of the model coefficients. This approach is known as $L1$ or Lasso Regression.

The most important difference between Ridge and Lasso regularization is that in Lasso some coefficients can be shrinked to zero and get completely eliminated from the model. Thus, Lasso regularization does not only prevent overfitting but also performs feature selection - a very handy thing when considering large linear models.

Finally, consider the following formulation of a loss (or cost) function:

$$Loss_{ElasticNet} = \frac{\sum_{i=1}^{n}(y_i-\hat{y_i})^2}{2n}+\lambda(\frac{1-\alpha}{2}\sum_{j=1}^{m}\hat{\beta_j}^2+\alpha\sum_{j=1}^{m}|\hat{\beta_j}|)$$

In the expression above - the Elastic Net regression - $\alpha$ is the mixing parameter between Lasso ($\alpha=1$) and Ridge ($\alpha=0$) approach. While the Ridge regression shrinks the coefficients of correlated predictors towards each other, the Lasso regression tends to pick one of them and decrease the others. The Elastic Net penalty is a mixture of the two approaches controlled by $\alpha$.

1.2 Inspect Multicolinearity in one particular Linear Regression problem¶

Here we are going to use King County, Washington State, USA housing data set. It is good for predicting house prices using MLR (multiple linear regression).

In [ ]:
data_set = pd.read_csv(os.path.join(data_dir, 'kc_house_data.csv'))
data_set.head(10)
Out[ ]:
id date price bedrooms bathrooms sqft_living sqft_lot floors waterfront view ... grade sqft_above sqft_basement yr_built yr_renovated zipcode lat long sqft_living15 sqft_lot15
0 7129300520 20141013T000000 221900.0 3 1.00 1180 5650 1.0 0 0 ... 7 1180 0 1955 0 98178 47.5112 -122.257 1340 5650
1 6414100192 20141209T000000 538000.0 3 2.25 2570 7242 2.0 0 0 ... 7 2170 400 1951 1991 98125 47.7210 -122.319 1690 7639
2 5631500400 20150225T000000 180000.0 2 1.00 770 10000 1.0 0 0 ... 6 770 0 1933 0 98028 47.7379 -122.233 2720 8062
3 2487200875 20141209T000000 604000.0 4 3.00 1960 5000 1.0 0 0 ... 7 1050 910 1965 0 98136 47.5208 -122.393 1360 5000
4 1954400510 20150218T000000 510000.0 3 2.00 1680 8080 1.0 0 0 ... 8 1680 0 1987 0 98074 47.6168 -122.045 1800 7503
5 7237550310 20140512T000000 1225000.0 4 4.50 5420 101930 1.0 0 0 ... 11 3890 1530 2001 0 98053 47.6561 -122.005 4760 101930
6 1321400060 20140627T000000 257500.0 3 2.25 1715 6819 2.0 0 0 ... 7 1715 0 1995 0 98003 47.3097 -122.327 2238 6819
7 2008000270 20150115T000000 291850.0 3 1.50 1060 9711 1.0 0 0 ... 7 1060 0 1963 0 98198 47.4095 -122.315 1650 9711
8 2414600126 20150415T000000 229500.0 3 1.00 1780 7470 1.0 0 0 ... 7 1050 730 1960 0 98146 47.5123 -122.337 1780 8113
9 3793500160 20150312T000000 323000.0 3 2.50 1890 6560 2.0 0 0 ... 7 1890 0 2003 0 98038 47.3684 -122.031 2390 7570

10 rows × 21 columns

What is the nature of our data set?

In [ ]:
data_set.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21613 entries, 0 to 21612
Data columns (total 21 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   id             21613 non-null  int64  
 1   date           21613 non-null  object 
 2   price          21613 non-null  float64
 3   bedrooms       21613 non-null  int64  
 4   bathrooms      21613 non-null  float64
 5   sqft_living    21613 non-null  int64  
 6   sqft_lot       21613 non-null  int64  
 7   floors         21613 non-null  float64
 8   waterfront     21613 non-null  int64  
 9   view           21613 non-null  int64  
 10  condition      21613 non-null  int64  
 11  grade          21613 non-null  int64  
 12  sqft_above     21613 non-null  int64  
 13  sqft_basement  21613 non-null  int64  
 14  yr_built       21613 non-null  int64  
 15  yr_renovated   21613 non-null  int64  
 16  zipcode        21613 non-null  int64  
 17  lat            21613 non-null  float64
 18  long           21613 non-null  float64
 19  sqft_living15  21613 non-null  int64  
 20  sqft_lot15     21613 non-null  int64  
dtypes: float64(5), int64(15), object(1)
memory usage: 3.5+ MB

Target: predict price from the numerical predictors¶

Let's use just subset of all given variables as predictors.

In [ ]:
# - for the simplicity we will drop some variables from the dataset
model_frame = data_set.drop(columns = ['id', 'lat', 'long', 'date',
                                      'zipcode', 'yr_renovated', 'waterfront',
                                      'view', 'sqft_basement'])
model_frame.head()
Out[ ]:
price bedrooms bathrooms sqft_living sqft_lot floors condition grade sqft_above yr_built sqft_living15 sqft_lot15
0 221900.0 3 1.00 1180 5650 1.0 3 7 1180 1955 1340 5650
1 538000.0 3 2.25 2570 7242 2.0 3 7 2170 1951 1690 7639
2 180000.0 2 1.00 770 10000 1.0 3 6 770 1933 2720 8062
3 604000.0 4 3.00 1960 5000 1.0 5 7 1050 1965 1360 5000
4 510000.0 3 2.00 1680 8080 1.0 3 8 1680 1987 1800 7503

We have to exclude predicted (target) variable from the training data set.

In [ ]:
predictors = model_frame.columns.drop('price')
predictors
Out[ ]:
Index(['bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors',
       'condition', 'grade', 'sqft_above', 'yr_built', 'sqft_living15',
       'sqft_lot15'],
      dtype='object')
In [ ]:
# - the correlations of the predictors
plt.figure(figsize=(10, 5))
sns.heatmap(model_frame[predictors].corr(), fmt='.2f', annot=True)
plt.title('Pearson correlations between predictor variables.', size=15, y=1.03);

Let's define formula for our model by introducing chosen variables names.

In [ ]:
# - right side of the formula
formula = ' + '.join(predictors)

# - left side of the formula
formula = 'price ~ ' + formula

formula
Out[ ]:
'price ~ bedrooms + bathrooms + sqft_living + sqft_lot + floors + condition + grade + sqft_above + yr_built + sqft_living15 + sqft_lot15'

And now fit the model using OLS.

In [ ]:
# - fitting the linear model to the data
mlr = smf.ols(formula=formula, data=model_frame).fit()
mlr.summary()
Out[ ]:
OLS Regression Results
Dep. Variable: price R-squared: 0.620
Model: OLS Adj. R-squared: 0.620
Method: Least Squares F-statistic: 3202.
Date: Fri, 28 Apr 2023 Prob (F-statistic): 0.00
Time: 15:53:01 Log-Likelihood: -2.9715e+05
No. Observations: 21613 AIC: 5.943e+05
Df Residuals: 21601 BIC: 5.944e+05
Df Model: 11
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 6.933e+06 1.36e+05 51.002 0.000 6.67e+06 7.2e+06
bedrooms -4.936e+04 2108.805 -23.406 0.000 -5.35e+04 -4.52e+04
bathrooms 5.01e+04 3625.970 13.817 0.000 4.3e+04 5.72e+04
sqft_living 195.1515 4.812 40.552 0.000 185.719 204.584
sqft_lot 0.0111 0.054 0.207 0.836 -0.094 0.116
floors 3.348e+04 3957.258 8.461 0.000 2.57e+04 4.12e+04
condition 1.898e+04 2580.038 7.356 0.000 1.39e+04 2.4e+04
grade 1.246e+05 2346.774 53.099 0.000 1.2e+05 1.29e+05
sqft_above -28.6643 4.678 -6.128 0.000 -37.833 -19.496
yr_built -3968.3906 69.774 -56.875 0.000 -4105.152 -3831.629
sqft_living15 36.2656 3.735 9.709 0.000 28.944 43.587
sqft_lot15 -0.5066 0.082 -6.172 0.000 -0.667 -0.346
Omnibus: 17771.672 Durbin-Watson: 1.984
Prob(Omnibus): 0.000 Jarque-Bera (JB): 1362466.098
Skew: 3.467 Prob(JB): 0.00
Kurtosis: 41.274 Cond. No. 4.47e+06


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 4.47e+06. This might indicate that there are
strong multicollinearity or other numerical problems.
In [ ]:
# - R^2 of the linear model
mlr.rsquared
Out[ ]:
0.6198708417378094

It is always good to include VIF information for variables, just for better overview.

In [ ]:
# - computing VIFs
model_frame_predictors = sm.add_constant(model_frame[predictors])
vifs = [variance_inflation_factor(model_frame_predictors.values, i) for i in range(1, len(predictors)+1)]
vifs = np.array(vifs).reshape(1, -1)
vifs
pd.DataFrame(vifs, columns=predictors)
Out[ ]:
bedrooms bathrooms sqft_living sqft_lot floors condition grade sqft_above yr_built sqft_living15 sqft_lot15
0 1.621833 3.287936 8.236167 2.085978 1.925165 1.188449 3.208241 6.325912 1.770933 2.763124 2.117307

1.3 Regularization of the model coefficients: Ridge (L2) Regularization¶

Now we are going to use scikit-learn to perform Regularized Multiple Linear Regression; we'll be using the Ridge Regression.

As we have see, the Ridge Regression model is obtained by minimizing the function

$$SSE + \alpha L_2^2,$$

where $SSE$ is the Sum Squarred Error of the 'ordinary' Multiple Linear Regression

$$\hat{y} = \beta_1x_1 + \beta_2x_2 + \cdots + \beta_kx_k + \beta_0,$$

and $L_{2}^2$ is the squared $L_{2}-$ norm of Multiple Linear Regression model parameters $\beta_1,\beta_2,\ldots,\beta_k$:

$$L_2^2 = \beta_1^2 + \beta_2^2 + \cdots + \beta_k^2,$$

and hence the name L2 Regularization. $\alpha$ is called penalization (hyper)parameter, and it's used to control the magnitude of models parameters; for $\alpha = 0$ we recover ordinary MLR.

In [ ]:
model_frame.head()
Out[ ]:
price bedrooms bathrooms sqft_living sqft_lot floors condition grade sqft_above yr_built sqft_living15 sqft_lot15
0 221900.0 3 1.00 1180 5650 1.0 3 7 1180 1955 1340 5650
1 538000.0 3 2.25 2570 7242 2.0 3 7 2170 1951 1690 7639
2 180000.0 2 1.00 770 10000 1.0 3 6 770 1933 2720 8062
3 604000.0 4 3.00 1960 5000 1.0 5 7 1050 1965 1360 5000
4 510000.0 3 2.00 1680 8080 1.0 3 8 1680 1987 1800 7503
In [ ]:
# - checking if all the variables are numerical
model_frame.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21613 entries, 0 to 21612
Data columns (total 12 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   price          21613 non-null  float64
 1   bedrooms       21613 non-null  int64  
 2   bathrooms      21613 non-null  float64
 3   sqft_living    21613 non-null  int64  
 4   sqft_lot       21613 non-null  int64  
 5   floors         21613 non-null  float64
 6   condition      21613 non-null  int64  
 7   grade          21613 non-null  int64  
 8   sqft_above     21613 non-null  int64  
 9   yr_built       21613 non-null  int64  
 10  sqft_living15  21613 non-null  int64  
 11  sqft_lot15     21613 non-null  int64  
dtypes: float64(3), int64(9)
memory usage: 2.0 MB

Split data into two parts, predictor and predicted variable data sets.

In [ ]:
# - feature matrix
X = model_frame[predictors].values

# - target vector
y = model_frame['price'].values

Now we are fitting sklearn's Ridge model, for Ridge regression.

In [ ]:
### --- Fitting a Ridge MLR Regularized model on the given data, with the penalization parameter \alpha = 10^4
mlr_ridge = linear_model.Ridge(alpha=1e+4)
mlr_ridge.fit(X, y)
Out[ ]:
Ridge(alpha=10000.0)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Ridge(alpha=10000.0)
In [ ]:
# - model intercept
mlr_ridge.intercept_
Out[ ]:
5945635.344976561
In [ ]:
# - model coefficients; the order corresponds to the order of coefficients given by the statsmodel 'ordinary' MLR
mlr_ridge.coef_
Out[ ]:
array([-2.91771401e+04,  1.81662857e+04,  2.26257901e+02,  1.41002993e-02,
        1.59074953e+04,  7.12968872e+03,  6.53923517e+04, -5.50894014e+00,
       -3.28129969e+03,  6.05780655e+01, -6.44587129e-01])
In [ ]:
# - R^2 score of the Ridge Regularization
mlr_ridge.score(X, y)
Out[ ]:
0.6013062798111166

How about we change the alpha value and see what happens with model parameters?

In [ ]:
### --- Varying the penalization parameter for the Ridge Regularized Model
n_alphas = 200
alphas = np.logspace(-2, 10, n_alphas)

coefs = []
scores = []
mserrs = []
l2_norms = []
for a in alphas:
    # - fitting a Ridge Regularization
    ridge = linear_model.Ridge(alpha=a)
    ridge.fit(X, y)
    
    # - R^2 score of a fitted Regularized Model
    R2 = ridge.score(X, y)
    
    # - MSE of a model
    mse = np.mean((y - ridge.predict(X))**2)
    
    # - squared L2 norm of model coefficients
    l2 = np.linalg.norm(ridge.coef_, ord=2)**2
    
    coefs.append(ridge.coef_)
    scores.append(R2)
    mserrs.append(mse)
    l2_norms.append(l2)
In [ ]:
### --- Plotting the dependence of model parameters vs. the penalization parameter
ax = plt.gca()

ax.plot(alphas, coefs)
ax.set_xscale("log")
ax.set_xlim(ax.get_xlim()[::-1])  # reverse axis
ax.axvline(x=1, c='r', ls='--') # red line, corresponding to \alpha = 1
ax.axvline(x=10000, c='g', ls='--') # green line, corresponding to \alpha = 10^4
plt.xlabel("alpha")
plt.ylabel("coefficient")
plt.title("Ridge coefficients as a function of the regularization")
plt.axis("tight")
plt.legend(predictors)
plt.show()

And what about how scaling alpha affects $L_2$ norm, MSE and R2 scores?

In [ ]:
### --- Plotting the dependance of various model metrics vs. the penalization hyperparameter
ig, ax = plt.subplots(1, 3, figsize=(24, 6), sharex=True)


# - squared L2-norm vs. \alpha
ax[0].plot(alphas, l2_norms)
ax[0].set_xscale("log")
ax[0].set_xlim(ax[0].get_xlim()[::-1])  # reverse axis
ax[0].axvline(x=1, c='r', ls='--') # red line
ax[0].axvline(x=10000, c='g', ls='--') # green line
ax[0].set_xlabel("alpha")
ax[0].set_ylabel("$L_2$ norm$^2$")
ax[0].set_title("$L_2$ norm$^2$ as a function of the regularization")
ax[0].axis("tight")

# - MSE vs. \alpha
ax[1].plot(alphas, mserrs)
ax[1].set_xscale("log")
ax[1].axvline(x=1, c='r', ls='--') # red line
ax[1].axvline(x=10000, c='g', ls='--') # green line
ax[1].set_xlabel("alpha")
ax[1].set_ylabel("MSE")
ax[1].set_title("MSE as a function of the regularization")
ax[1].axis("tight")

# - R^2 vs. \alpha
ax[2].plot(alphas, scores)
ax[2].set_xscale("log")
ax[2].axvline(x=1, c='r', ls='--') # red line
ax[2].axvline(x=10000, c='g', ls='--') # green line
ax[2].set_xlabel("alpha")
ax[2].set_ylabel("$R^2$")
ax[2].set_title("$R^2$ score as a function of the regularization")
ax[2].axis("tight");

We see that the best model is obtained with small penalization hyperparameter, i.e. with large model parameter (coefficient) values.

1.4 LASSO (L1) Regularization¶

LASSO is another way to regularize MLR model. It's obtained by minimizing the function

$$SSE + \alpha L_1,$$

where $SSE$ is Sum Squared Error of the ordinary Multiple Linear Regression and $L_1$ is $L_1-$ norm of its coeffitents given by

$$L_1 = |\beta_1| + |\beta_2| + \cdots + |\beta_k|,$$

and hence the name L1 Regularization. $\alpha$ is, of course, penalization hyperparameter.

Here we are using sklearn's LASSO model, for Lasso regularized model.

In [ ]:
### --- Varying the penalization parameter for the LASSO Regularized Model
n_alphas = 200
alphas = np.logspace(-2, 10, n_alphas)

coefs = []
scores = []
mserrs = []
l1_norms = []
for a in alphas:
    # - fitting a LASSO Regularization
    lasso = linear_model.Lasso(alpha=a)
    lasso.fit(X, y)
    
    # - R^2 score of a fitted Regularized Model
    R2 = lasso.score(X, y)
    
    # - MSE of a model
    mse = np.mean((y - lasso.predict(X))**2)
    
    # - L1 norm of model coefficients
    l1 = np.linalg.norm(lasso.coef_, ord=1)
    
    coefs.append(lasso.coef_)
    scores.append(R2)
    mserrs.append(mse)
    l1_norms.append(l1)
In [ ]:
### --- Plotting the dependence of model parameters vs. the penalization parameter
ax = plt.gca()

ax.plot(alphas, coefs)
ax.set_xscale("log")
ax.set_xlim(ax.get_xlim()[::-1])  # reverse axis, 
ax.axvline(x=1, c='r', ls='--') # red line, corresponding to \alpha = 1
plt.xlabel("alpha")
plt.ylabel("weights")
plt.title("LASSO coefficients as a function of the regularization")
plt.axis("tight")
plt.legend(predictors)
plt.show()

And what about how scaling alpha affects $L_2$ norm, MSE and R2 scores?

In [ ]:
### --- Plotting the dependance of various model metrics vs. the penalization hyperparameter
ig, ax = plt.subplots(1, 3, figsize=(24, 6), sharex=True)

# - L1-norm vs. \alpha
ax[0].plot(alphas, l1_norms)
ax[0].set_xscale("log")
ax[0].set_xlim(ax[0].get_xlim()[::-1])  # reverse axis
ax[0].axvline(x=1, c='r', ls='--') # red line
ax[0].set_xlabel("alpha")
ax[0].set_ylabel("L1 norm")
ax[0].set_title("L1 norm as a function of the regularization")
ax[0].axis("tight")

# - MSE vs. \alpha
ax[1].plot(alphas, mserrs)
ax[1].set_xscale("log")
ax[1].axvline(x=1, c='r', ls='--') # red line
ax[1].set_xlabel("alpha")
ax[1].set_ylabel("MSE")
ax[1].set_title("MSE as a function of the regularization")
ax[1].axis("tight")

# - R^2 vs. \alpha
ax[2].plot(alphas, scores)
ax[2].set_xscale("log")
ax[2].axvline(x=1, c='r', ls='--') # red line
ax[2].set_xlabel("alpha")
ax[2].set_ylabel("$R^2$")
ax[2].set_title("$R^2$ score as a function of the regularization")
ax[2].axis("tight");

We see that the best model is obtained with small penalization hyperparameter, i.e. with big model parameters values.

2. Maximum Likelihood Estimator (MLE)¶

Maximum Likelihood Estimation (MLE) is yet another approach in estimating model parameters. It is a way of estimating the parameter(s) of probability distribution that generated the sample that we have at hand. Before showing the examples of how parameters values are estimated using MLE, let's refresh some of the things we talked about during the session 11.

Back then, we have introduced something that is called Bayes' Theorem. Given that we have two events $A$ and $B$, and given that $P(A)\neq0$ we can define:

$$ P(B|A) = \frac{P(A|B)P(B)}{P(A)}.$$

Also, we said that $P(A|B)$ is also called likelihood, and introduced another way of naming it. Namely $\mathcal{L}(B|A)$. If you remember we said that:

$$\mathcal{L}(B|A) = P(A|B).$$

So, the likelihood of the event $B$ occuring given that the event $A$ has occured is equal to the conditional probability of event $A$ occuring under the condition that the event $B$ has occured.

If we speak in terms of random varable and distribution parameters we usually write:

$$P(\theta; X) = \frac{\mathcal{L}(\theta; X)P(\theta)}{P(X)}$$

and

$$\mathcal{L}(\theta; X) = P(X; \theta).$$

where $X$ is RV and $\theta$ are its distributions' parameters (when there is a single parameter $\theta$ is scallar, and when there are multiple parameters ${\theta}$ is a vector).

Now, imagine we have RV $X$ sample of 3 observations $x_1, x_2, x_3$. In this case, previously rewritten Bayes' Theorem would be:

$$P(\theta; x1, x2, x3) = \frac{\mathcal{L}(\theta; x_1, x_2, x_3)P(\theta)}{P(x_1, x_2, x_3)}.$$

If we focus just on $\mathcal{L}(\theta; x_1, x_2, x_3)$ term, and use already defined equality $\mathcal{L}(\theta; X) = P(X; \theta)$, knowing that $x_1, x_2$ and $x_3$ are statistically independent, we have:

$$\mathcal{L}(\theta; x_1, x_2, x_3) = P(x_1, x_2, x_3; \theta) = P(x_1; \theta) * P(x_2; \theta) * P(x_3; \theta). $$

In general, if we have the sample of N observations, we have:

$$\mathcal{L}(\theta; x_1, x_2,..., x_N) = \prod_{i=1}^{N}{P(x_i; \theta)}. $$

This multiplication can easily get computationally intractable in practice. This is the reason why we define log-likelihood as:

$$\mathcal{l}(\theta; x_1, x_2,..., x_N) = log(\prod_{i=1}^{N}{P(x_i; \theta)}) = \sum_{i=1}^{N}{log(P(x_i; \theta))}.$$

This works because the logarithm is a strictly increasing function, and by maximizing the log-likelihood we are in essence maximizing likelihood.

And now, the examples.

Tossing a coin¶

Imagine I have a fair coin: the one that lands Heads and Tails with equal probabilities P(H) = P(T) = 1/2.

In [ ]:
tosses = np.array([1,0,1,0,1,0,0,1,0,0])
print(tosses)
[1 0 1 0 1 0 0 1 0 0]
In [ ]:
np.unique(tosses, return_counts=True)[1]
Out[ ]:
array([6, 4])

Uhm, but I have already assumed that we will be tossing a fair coin? Four heads, six tails, how come? Is this really a fair coin? The answer is: it still might be. Let's see.

In [ ]:
outcomes = np.array([0,1])
tosses = np.random.choice(outcomes, 10, replace=True, p=[.5, .5])
print(tosses)
print(np.unique(tosses, return_counts=True)[1]/tosses.size)
[0 0 0 0 1 0 1 0 0 1]
[0.7 0.3]

What about now?

In [ ]:
outcomes = np.array([0,1])
tosses = np.random.choice(outcomes, 10, replace=True, p=[.5, .5])
print(tosses)
print(np.unique(tosses, return_counts=True)[1]/tosses.size)
[0 1 1 0 0 0 0 0 0 0]
[0.8 0.2]

Let's create 1000 samples and check the proportions of heas and tails.

In [ ]:
outcomes = np.array([0,1])
num_exps = 1_000
stat_exp_1 = []

for i in range(num_exps):
    tosses = np.random.choice(outcomes, 100, replace=True, p=[.5, .5])
    result = np.array([(tosses==0).sum(), (tosses==1).sum()])/tosses.size
    result = {'P_T':result[0], 'P_H':result[1]}
    stat_exp_1.append(result)
    
stat_exp_1 = pd.DataFrame(stat_exp_1)    
stat_exp_1.head(20)
Out[ ]:
P_T P_H
0 0.47 0.53
1 0.52 0.48
2 0.41 0.59
3 0.52 0.48
4 0.58 0.42
5 0.55 0.45
6 0.60 0.40
7 0.41 0.59
8 0.48 0.52
9 0.49 0.51
10 0.62 0.38
11 0.60 0.40
12 0.41 0.59
13 0.50 0.50
14 0.51 0.49
15 0.52 0.48
16 0.43 0.57
17 0.52 0.48
18 0.59 0.41
19 0.48 0.52

Let's plot the distribution of these probabilities.

In [ ]:
# - matplotlib figure size
plt.rcParams['figure.figsize'] = [10, 4]

stat_exp_1.hist(bins=100)

sns.despine(offset=10, trim=True)

Ok, now let' see what happens with 10,000 statistical experiments with a fair coin:

In [ ]:
# - matplotlib figure size
plt.rcParams['figure.figsize'] = [10, 4]

outcomes = np.array([0,1])
num_exps = 10_000
stat_exp_1 = []

for i in range(num_exps):
    tosses = np.random.choice(outcomes, 100, replace=True, p=[.5, .5])
    result = np.array([(tosses==0).sum(), (tosses==1).sum()])/tosses.size
    result = {'P_T':result[0], 'P_H':result[1]}
    stat_exp_1.append(result)

stat_exp_1 = pd.DataFrame(stat_exp_1)
stat_exp_1.hist(bins=100)

sns.despine(offset=10, trim=True)

What about 100,000 experiments?

In [ ]:
# - matplotlib figure size
plt.rcParams['figure.figsize'] = [10, 4]

outcomes = np.array([0,1])
num_exps = 100_000
stat_exp_1 = []

for i in range(num_exps):
    tosses = np.random.choice(outcomes, 100, replace=True, p=[.5, .5])
    result = np.array([(tosses==0).sum(), (tosses==1).sum()])/tosses.size
    result = {'P_T':result[0], 'P_H':result[1]}
    stat_exp_1.append(result)
    
stat_exp_1 = pd.DataFrame(stat_exp_1)
stat_exp_1.hist(bins=100)

sns.despine(offset=10, trim=True)

Theoretical probability: when we know that some stochastic system produces an outcome with some particular probability.

Experimental probability: when we estimate that some stochastic system produces an outcome with some particular probability from observed data.

In [ ]:
stat_exp_1['P_H'].mean()
Out[ ]:
0.5001650000000001
In [ ]:
stat_exp_1['P_T'].mean()
Out[ ]:
0.499835

Reverse engineering the coin¶

In all my previous statistical experiments, I have assumed that the coin is fair, e.g. P(H) = P(T) = .5, look:

tosses = np.random.choice(outcomes, 100, replace=True, p=[.5, .5])

I have used numerical simulations of coin tosses to demonstrate that with increasing number of experiments I get to estimate the known probabilities of P(H) and P(T) better and better.

However, we typically do not know the theoretical probability of a random event... Example: the CTR of some social media post.

CTR (Click-Through Rate) is like a coin toss: a post is clicked or not clicked upon each served impression. We can thus threat the CTR like a coin toss: P(H) = P(Clicked) is "post clicked", P(T) = P(Not clicked) is "not clicked". Then we can say that P(Clicked) is some latent, essential characteristic of post.

Imagine we have a set of observations for a particular social media post: 1,1,1,0,0,0,1,0,1,0,1,0,0,0,1,0.., where 1 = Clicked and 0 = Not clicked. How do we estimate the CTR of the post from these data?

Easy, right: sum up the vector of observations and divide by its length. But wait. There's more to it.

I have just invented some social media post with a CTR of .3, and estimated its experimental probability to be .3035; that is the best estimate that I currently have. What if I want to test a set of hypotheses about the CTR of this post? What, for example, if I am intersted to learn about the probability of this post having a CTR of .25 instead?

Let's use a simpler, smaller set of observations to begin with.

Imagine we toss a fair coin with P(H)=.5 twice and observe two Heads. The probability of observing two heads with P(H)=.5 is:

.5 * .5 = .25

But what if the coin was not fair, and instead had a probability of P(H)=.3? Then:

.3 * .3 = 0.09

And what if the coin had P(H) = .7?

.7 * .7 = 0.49

With P(H) = .9 we get:

.9 * .9 = 0.81

and it seems intuitive and logical: if we have observed a sample - however small it was - of all heads (and are assuming that we have observed two Heads from two tosses here), it makes sense that we are dealing with a coin that almost always lands Heads.

Let's express this intuition visually. First, we need a function to compute the probability that a coin with a certain P(H) produced the data.

In [ ]:
outcomes = np.array([0,1])
clicks = np.random.choice(outcomes, 10000, replace=True, p=[.7, .3])
ctr = np.array([(clicks==1).sum()])/clicks.size
print(ctr)
[0.3088]

Let's define a helper function for coin tossing. It is going to give us probability of observation, based on probability of getting Heads. If we say $P(H)=.2$ then each time we get Heads, function returns 0.2.

In [ ]:
def parameter(observation, heads_probability):

    if observation==1:
        return heads_probability
    
    elif observation==0:
        return 1 - heads_probability
    
    else:
        return np.nan

# we are vectorizing the function
pars = np.vectorize(parameter)

# here is our sample of coin toss results
data = np.array([1, 1, 0])

# these are the probabilities of each coin toss result
print(pars(data, .25))
[0.25 0.25 0.75]

Remember that we have defined likelihood as multiplication of each observation's probability? Here we define function do to just that.

In [ ]:
def likelihood(x, loglikelihood=False):
    if loglikelihood is True:
        return np.sum(np.log2(x))
    else:
        return np.prod(x)
In [ ]:
print(likelihood(pars(data, .25)))
0.046875
In [ ]:
print(likelihood(pars(data, .25), loglikelihood=True))
-4.415037499278844

Nice. Now, let's define our parameter's domain. Remember, we are trying to find maximum likelihood for probability of getting Heads ($P(H)$). Since this is a probability, domain is range $[0, 1]$.

In [ ]:
test_parameters = np.linspace(0, 1, 1000)

# first 100 values of P(H)
print(test_parameters[0:100])
[0.         0.001001   0.002002   0.003003   0.004004   0.00500501
 0.00600601 0.00700701 0.00800801 0.00900901 0.01001001 0.01101101
 0.01201201 0.01301301 0.01401401 0.01501502 0.01601602 0.01701702
 0.01801802 0.01901902 0.02002002 0.02102102 0.02202202 0.02302302
 0.02402402 0.02502503 0.02602603 0.02702703 0.02802803 0.02902903
 0.03003003 0.03103103 0.03203203 0.03303303 0.03403403 0.03503504
 0.03603604 0.03703704 0.03803804 0.03903904 0.04004004 0.04104104
 0.04204204 0.04304304 0.04404404 0.04504505 0.04604605 0.04704705
 0.04804805 0.04904905 0.05005005 0.05105105 0.05205205 0.05305305
 0.05405405 0.05505506 0.05605606 0.05705706 0.05805806 0.05905906
 0.06006006 0.06106106 0.06206206 0.06306306 0.06406406 0.06506507
 0.06606607 0.06706707 0.06806807 0.06906907 0.07007007 0.07107107
 0.07207207 0.07307307 0.07407407 0.07507508 0.07607608 0.07707708
 0.07807808 0.07907908 0.08008008 0.08108108 0.08208208 0.08308308
 0.08408408 0.08508509 0.08608609 0.08708709 0.08808809 0.08908909
 0.09009009 0.09109109 0.09209209 0.09309309 0.09409409 0.0950951
 0.0960961  0.0970971  0.0980981  0.0990991 ]
In [ ]:
# - setup
likelihood_frame = []
outcomes = np.array([0,1])
data = np.random.choice(outcomes, 1000, replace=True, p=[.7, .3])

# - iterate
for pr in test_parameters:
    d = likelihood(pars(data, pr))
    d = {'P_H':pr, 'Likelihood':d}
    likelihood_frame.append(d)

# - compose
likelihood_frame = pd.DataFrame(likelihood_frame)

# - inspect
likelihood_frame.head(10)
Out[ ]:
P_H Likelihood
0 0.000000 0.0
1 0.001001 0.0
2 0.002002 0.0
3 0.003003 0.0
4 0.004004 0.0
5 0.005005 0.0
6 0.006006 0.0
7 0.007007 0.0
8 0.008008 0.0
9 0.009009 0.0
In [ ]:
plt.rcParams['figure.figsize'] = [6, 5]

likelihood_frame.plot(x='P_H', y='Likelihood');

sns.despine(offset=10, trim=True)

Let's ask explicitly:

In [ ]:
maximum_likelihood = likelihood_frame.idxmax()[1]

likelihood_frame['P_H'][maximum_likelihood]
Out[ ]:
0.2922922922922923

And what if P(H) was .85?

In [ ]:
# - setup
likelihood_frame = []
outcomes = np.array([0,1])
data = np.random.choice(outcomes, 1000, replace=True, p=[.15, .85])

# - iterate
for pr in test_parameters:
    d = likelihood(pars(data, pr))
    d = {'P_H': pr, 'Likelihood':d}
    likelihood_frame.append(d)

# - compose
likelihood_frame = pd.DataFrame(likelihood_frame)

# - inspect
likelihood_frame.head(10)
Out[ ]:
P_H Likelihood
0 0.000000 0.0
1 0.001001 0.0
2 0.002002 0.0
3 0.003003 0.0
4 0.004004 0.0
5 0.005005 0.0
6 0.006006 0.0
7 0.007007 0.0
8 0.008008 0.0
9 0.009009 0.0
In [ ]:
maximum_likelihood = likelihood_frame.idxmax()[1]

likelihood_frame['P_H'][maximum_likelihood]
Out[ ]:
0.8698698698698699

Pick any arbitrary observation

In [ ]:
# - setup
likelihood_frame = []
data = np.array([1,1,0,1,1,1,1,1,0,0,0,0,1,1,0,1,1])

# - iterate
for pr in test_parameters:
    d = likelihood(pars(data, pr))
    d = {'P_H': pr, 'Likelihood':d}
    likelihood_frame.append(d)

# - compose
likelihood_frame = pd.DataFrame(likelihood_frame)

# - inspect
likelihood_frame.head(10)
Out[ ]:
P_H Likelihood
0 0.000000 0.000000e+00
1 0.001001 1.005009e-33
2 0.002002 2.045915e-30
3 0.003003 1.759043e-28
4 0.004004 4.139855e-27
5 0.005005 4.790436e-26
6 0.006006 3.537903e-25
7 0.007007 1.916616e-24
8 0.008008 8.275962e-24
9 0.009009 3.005145e-23
In [ ]:
# - matplotlib figure size
plt.rcParams['figure.figsize'] = [6, 5]

# - plot
likelihood_frame.plot(x='P_H', y='Likelihood');

sns.despine(offset=10, trim=True)
In [ ]:
maximum_likelihood = likelihood_frame.idxmax()[1]
likelihood_frame['P_H'][maximum_likelihood]
Out[ ]:
0.6466466466466466
In [ ]:
np.unique(data, return_counts=True)[1]
Out[ ]:
array([ 6, 11])

Good. What have we learned?

To estimate the parameter of a simple statistical model from data - a coin, in our case, a simple stochastic system with two outcome states only - we need to find the maximum of its likelihood function given the observed data.

For our example with the HH minimal observations - tossing a coin two times and observing two Heads in a row - the likelihood function is:

In [ ]:
# - setup
likelihood_frame = []
data = np.array([1,1])

# - iterate
for pr in test_parameters:
    d = likelihood(pars(data, pr))
    d = {'P_H': pr, 'Likelihood':d}
    likelihood_frame.append(d)

# - compose
likelihood_frame = pd.DataFrame(likelihood_frame)

# - inspect
likelihood_frame.head(10)

# - matplotlib figure size
plt.rcParams['figure.figsize'] = [6, 5]

# - plot
likelihood_frame.plot(x='P_H', y='Likelihood');

sns.despine(offset=10, trim=True)

Of course: however small the sample size, the sample represents just everything that we know about a coin; if we observe all heads, like in: HH, then the most probable coin that could have produced the result is definitely the one that yields Heads all the time!

2.1 MLE in Linear Regression¶

Let's use Iris data set for another example.

In [ ]:
iris = pd.read_csv(os.path.join(data_dir, 'iris.csv'))
iris.head(10)
Out[ ]:
sepal_length sepal_width petal_length petal_width class
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa
5 5.4 3.9 1.7 0.4 setosa
6 4.6 3.4 1.4 0.3 setosa
7 5.0 3.4 1.5 0.2 setosa
8 4.4 2.9 1.4 0.2 setosa
9 4.9 3.1 1.5 0.1 setosa
In [ ]:
linear_model = smf.ols(formula='petal_length ~ sepal_length', data=iris).fit()
linear_model.summary()
Out[ ]:
OLS Regression Results
Dep. Variable: petal_length R-squared: 0.760
Model: OLS Adj. R-squared: 0.758
Method: Least Squares F-statistic: 468.6
Date: Fri, 28 Apr 2023 Prob (F-statistic): 1.04e-47
Time: 15:53:16 Log-Likelihood: -190.49
No. Observations: 150 AIC: 385.0
Df Residuals: 148 BIC: 391.0
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept -7.0954 0.506 -14.011 0.000 -8.096 -6.095
sepal_length 1.8575 0.086 21.646 0.000 1.688 2.027
Omnibus: 0.255 Durbin-Watson: 1.204
Prob(Omnibus): 0.880 Jarque-Bera (JB): 0.384
Skew: -0.084 Prob(JB): 0.825
Kurtosis: 2.817 Cond. No. 43.4


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
In [ ]:
iris['Prediction'] = linear_model.predict()
iris['Residuals'] = linear_model.resid
iris.head(10)
Out[ ]:
sepal_length sepal_width petal_length petal_width class Prediction Residuals
0 5.1 3.5 1.4 0.2 setosa 2.377918 -0.977918
1 4.9 3.0 1.4 0.2 setosa 2.006416 -0.606416
2 4.7 3.2 1.3 0.2 setosa 1.634914 -0.334914
3 4.6 3.1 1.5 0.2 setosa 1.449163 0.050837
4 5.0 3.6 1.4 0.2 setosa 2.192167 -0.792167
5 5.4 3.9 1.7 0.4 setosa 2.935171 -1.235171
6 4.6 3.4 1.4 0.3 setosa 1.449163 -0.049163
7 5.0 3.4 1.5 0.2 setosa 2.192167 -0.692167
8 4.4 2.9 1.4 0.2 setosa 1.077661 0.322339
9 4.9 3.1 1.5 0.1 setosa 2.006416 -0.506416

Let's plot petal length as a function of sepal length.

In [ ]:
# - plotting the true data, predicted values and the prediction line
sns.regplot(data=iris, x='sepal_length', y='petal_length', ci=0, line_kws={'color':'red'})
sns.scatterplot(data=iris, x='sepal_length', y='Prediction', color='red', s=50)
sns.despine(offset=10, trim=True)
plt.title('petal_length ~ sepal_length', fontsize=14);
In [ ]:
print("Pearson's correlation (R-value): " + str(round(np.sqrt(linear_model.rsquared), 2)))
print("Coefficient of determination (R^2): " + str(round(linear_model.rsquared, 2)))
Pearson's correlation (R-value): 0.87
Coefficient of determination (R^2): 0.76

How about we use optimization to get the model's parameters?

In [ ]:
# - normal distribution
from scipy.stats import norm
from scipy.optimize import minimize

# - nll function
def lg_nll(pars, data):
    # - pick up the parameters
    beta_0 = pars[0]
    beta_1 = pars[1]
    sigma = np.abs(pars[2])
    # - predict from parameters
    preds = beta_0+beta_1*data['sepal_length']
    # - compute residuals
    residuals = data['petal_length']-preds
    # - negative log-likelihood
    nll = -np.sum(norm.logpdf(residuals, loc=0, scale=sigma))
    # - out:
    return nll

# - initial (random) parameter values
init_beta_0 = np.random.uniform(low=-15, high=15, size=1)
init_beta_1 = np.random.uniform(low=-15, high=15, size=1)
init_sigma = np.random.uniform(low=.1, high=10, size=1)
init_pars = [init_beta_0, init_beta_1, init_sigma]

# - optimize w. Nelder-Mead
optimal_model = minimize(
    # - fun(parameters, args)
    fun=lg_nll,
    args = (iris), 
    x0 = init_pars, 
    method='Nelder-Mead',
    options={'maxiter':10e6, 
            'maxfev':10e6,
            'fatol':10e-12})

# - optimal parameters
optimal_model.x
Out[ ]:
array([-7.09538182,  1.85750971,  0.86158005])
In [ ]:
# Check against `statsmodels`
linear_model.params
Out[ ]:
Intercept      -7.095381
sepal_length    1.857510
dtype: float64
In [ ]:
# Log-Likelihood
loglike = -lg_nll(optimal_model.x, data=iris)
print(loglike)
-190.4926826522033
In [ ]:
linear_model.llf
Out[ ]:
-190.49268265220212
In [ ]:
beta_0_vals = np.linspace(-10,0,300)
beta_1_vals = np.linspace(0,3,300)

sigma = optimal_model.x[2]

grid = np.array([(beta_0, beta_1) for beta_0 in beta_0_vals for beta_1 in beta_1_vals])

grid = pd.DataFrame(grid)
grid = grid.rename(columns={0: "beta_0", 1: "beta_1"})

nll = []

for i in range(grid.shape[0]):
    pars = [grid['beta_0'][i],grid['beta_1'][i],sigma]
    nll.append(lg_nll(pars, iris))

grid['likelihood'] = nll
grid['likelihood'] = np.exp(-grid['likelihood'])
grid.sort_values('likelihood', ascending=False, inplace=True)
grid.head()
Out[ ]:
beta_0 beta_1 likelihood
26285 -7.090301 1.856187 1.860885e-83
26884 -7.023411 1.846154 1.840102e-83
25686 -7.157191 1.866221 1.830589e-83
25387 -7.190635 1.876254 1.780881e-83
27483 -6.956522 1.836120 1.769939e-83
In [ ]:
# - import plotly
import plotly.graph_objects as go
import plotly.io as pio
pio.renderers.default = "plotly_mimetype+notebook"

# - Mesh3d: Objective Function
fig = go.Figure(data=[go.Mesh3d(
    x=grid['beta_0'], 
    y=grid['beta_1'], 
    z=grid['likelihood'], 
    color='red', 
    opacity=0.50)])
fig.update_layout(scene = dict(
                    xaxis_title='Beta_0',
                    yaxis_title='Beta_1',
                    zaxis_title='Likelihood'),
                    width=700,
                    margin=dict(r=20, b=10, l=10, t=10))
fig.show()

Further Reading¶

  • StatQuest w. Josh Starmer, Regularization Part 1: Ridge (L2) Regression
  • StatQuest w. Josh Starmer, Regularization Part 2: LASSO (L1) Regression
  • StatQuest w. Josh Starmer, Regularization Part 2: Elastic Net Regression
  • Khan Academy, Algebra 2, Unit: Logarithms
  • StatQuest w. Josh Starmer, Maximum Likelihood, clearly explained!!!
  • Timothy Schulz, Introduction to Maximum Likelihood Estimation

DataKolektiv, 2022/23.

hello@datakolektiv.com

License: GPLv3 This Notebook is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This Notebook is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this Notebook. If not, see http://www.gnu.org/licenses/.